Abstract¶

In the initial phase of this analysis, I prioritize meticulous data preparation and thorough examination, commonly referred to as data preprocessing and exploratory data analysis (EDA). This essential step serves as the cornerstone of the entire study, with the aim of unraveling insights and identifying anomalies within the dataset. Subsequently, a combination of unsupervised and supervised machine learning techniques is employed to extract patterns, categorize tumors, and forecast numerical features. This multifaceted approach maximizes the utilization of the dataset's richness, facilitating the derivation of actionable conclusions. Moreover, a strong emphasis is placed on ensuring clarity and transparency throughout the methodology, with a focus on reproducibility. Additionally, ethical implications, particularly in the realm of healthcare applications of machine learning, are conscientiously considered.

1.1 Introduction¶

Pre-processing the dataset and conducting Exploratory Data Analysis (EDA) are vital for understanding data structure and preparing it for analysis. This phase includes handling missing, duplicated, or outlier values to ensure data integrity. Transforming data may be necessary for normalization or scaling, while categorical features are encoded numerically. Splitting the dataset into training and test sets facilitates model evaluation. Feature engineering techniques, like extraction and selection, enhance predictive power. Informative plots and tables visualize distributions, correlations, and trends. Statistical assumptions are assessed to validate analytical approaches. By pre-processing and conducting EDA, analysts gain insights into the dataset, enabling informed decision-making and robust model development.

1.2Methodology¶

1.2.1 Load the dataset and remove any missing values and duplicated rows.¶

Load the dataset using the file path of the dataset. Missing values in a dataset can introduce bias or errors during analysis, as algorithms may struggle to handle them effectively, resulting in suboptimal performance. Additionally, duplicate rows may indicate errors in data collection, potentially compromising the integrity of the dataset. Ensuring data integrity is paramount, particularly because duplicate rows can contribute to overfitting, thereby impeding models' ability to generalize to new data. Therefore, it is crucial to address both missing values and duplicates through appropriate data cleaning processes to uphold the reliability of analyses and models.

Dealing with missing values and duplicates is crucial for effective data cleaning and preparation, significantly influencing the quality and reliability of subsequent analyses or machine learning models. To uphold data integrity, systematically removing missing values and duplicate entries is essential. Neglecting or mishandling these issues may result in biased results, diminished model performance, and unreliable insights. Thus, giving meticulous attention to removing missing values and duplicates is vital for improving the strength and accuracy of data-driven analyses, ensuring the validity of conclusions drawn from the dataset.

In [55]:
import pandas as pd

# Path of the dataset
file_path = 'D:/Karthika University/MS4S16MachineLearning/Assignment/MS4S16_Dataset.csv'

# Load the dataset using pandas(Data set name data_set)
data_set = pd.read_csv(file_path)

# Find the missing valu
missing_values = data_set.isnull().sum()
print("Missing Values:\n", missing_values)
# remove missingvalues
data_set = data_set.dropna()

# Find the duplicated rows
duplicated_rows = data_set[data_set.duplicated()]
print("duplicat_rows:\n", duplicated_rows)

# Duplicated rows remove from data set
dataset_cleaned = data_set.drop_duplicates()
print(f"Number of duplicated rows: {len(duplicated_rows)}")
print(f"Shape of the original DataFrame: {data_set.shape}")
print(f"Shape of the DataFrame after removing duplicates: {dataset_cleaned.shape}")

# Cleanned dataset name is data
data = pd.DataFrame(dataset_cleaned)
print(data)
Missing Values:
 id                          3
diagnosis                   3
radius_mean                 5
texture_mean                6
perimeter_mean              4
area_mean                   5
smoothness_mean             3
compactness_mean            4
concavity_mean              4
concave points_mean         8
symmetry_mean               3
fractal_dimension_mean      4
radius_se                   6
texture_se                  8
perimeter_se                3
area_se                     6
smoothness_se               6
compactness_se              7
concavity_se                8
concave points_se           9
symmetry_se                 8
fractal_dimension_se        7
radius_worst               13
texture_worst              21
perimeter_worst             6
area_worst                  4
smoothness_worst            9
compactness_worst           4
concavity_worst             3
concave points_worst        6
symmetry_worst              4
fractal_dimension_worst    13
dtype: int64
duplicat_rows:
            id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
493  914062.0         M        18.01       -999.00          118.40     1007.0   
570   92751.0         B         7.76         24.54           47.92      181.0   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
493          0.10010           0.12890           0.117              0.07762   
570          0.05263           0.04362           0.000              0.00000   

     ...  radius_worst  texture_worst  perimeter_worst  area_worst  \
493  ...        21.530          26.06           143.40      1426.0   
570  ...         9.456          30.37            59.16       268.6   

     smoothness_worst  compactness_worst  concavity_worst  \
493           0.13090            0.23270           0.2544   
570           0.08996            0.06444           0.0000   

     concave points_worst  symmetry_worst  fractal_dimension_worst  
493                0.1489          0.3251                  0.07625  
570                0.0000          0.2871                  0.07039  

[2 rows x 32 columns]
Number of duplicated rows: 2
Shape of the original DataFrame: (482, 32)
Shape of the DataFrame after removing duplicates: (480, 32)
             id diagnosis  radius_mean  texture_mean  perimeter_mean  \
0      842302.0         M        17.99         10.38          122.80   
1      842517.0         M        20.57         17.77          132.90   
2    84300903.0         M        19.69         21.25          130.00   
3    84348301.0         M        11.42         20.38           77.58   
4    84358402.0         M        20.29         14.34          135.10   
..          ...       ...          ...           ...             ...   
564    926125.0         M        20.92         25.09          143.00   
565    926424.0         M        21.56         22.39          142.00   
567    926954.0         M        16.60         28.08          108.30   
568    927241.0         M        20.60         29.33          140.10   
569     92751.0         B         7.76         24.54           47.92   

     area_mean  smoothness_mean  compactness_mean  concavity_mean  \
0       1001.0          0.11840           0.27760         0.30010   
1       1326.0          0.08474           0.07864         0.08690   
2       1203.0          0.10960           0.15990         0.19740   
3        386.1          0.14250           0.28390         0.24140   
4       1297.0          0.10030           0.13280         0.19800   
..         ...              ...               ...             ...   
564     1347.0          0.10990           0.22360         0.31740   
565     1479.0          0.11100           0.11590         0.24390   
567      858.1          0.08455           0.10230         0.09251   
568     1265.0          0.11780           0.27700         0.35140   
569      181.0          0.05263           0.04362         0.00000   

     concave points_mean  ...  radius_worst  texture_worst  perimeter_worst  \
0                0.14710  ...        25.380          17.33           184.60   
1                0.07017  ...        24.990          23.41           158.80   
2                0.12790  ...        23.570          25.53           152.50   
3                0.10520  ...        14.910          26.50            98.87   
4                0.10430  ...        22.540          16.67           152.20   
..                   ...  ...           ...            ...              ...   
564              0.14740  ...        24.290          29.41           179.10   
565              0.13890  ...        25.450          26.40           166.10   
567              0.05302  ...        18.980          34.12           126.70   
568              0.15200  ...        25.740          39.42           184.60   
569              0.00000  ...         9.456          30.37            59.16   

     area_worst  smoothness_worst  compactness_worst  concavity_worst  \
0        2019.0           0.16220            0.66560           0.7119   
1        1956.0           0.12380            0.18660           0.2416   
2        1709.0           0.14440            0.42450           0.4504   
3         567.7           0.20980            0.86630           0.6869   
4        1575.0           0.13740            0.20500           0.4000   
..          ...               ...                ...              ...   
564      1819.0           0.14070            0.41860           0.6599   
565      2027.0           0.14100            0.21130           0.4107   
567      1124.0           0.11390            0.30940           0.3403   
568      1821.0           0.16500            0.86810           0.9387   
569       268.6           0.08996            0.06444           0.0000   

     concave points_worst  symmetry_worst  fractal_dimension_worst  
0                  0.2654          0.4601                  0.11890  
1                  0.1860          0.2750                  0.08902  
2                  0.2430          0.3613                  0.08758  
3                  0.2575          0.6638                  0.17300  
4                  0.1625          0.2364                  0.07678  
..                    ...             ...                      ...  
564                0.2542          0.2929                  0.09873  
565                0.2216          0.2060                  0.07115  
567                0.1418          0.2218                  0.07820  
568                0.2650          0.4087                  0.12400  
569                0.0000          0.2871                  0.07039  

[480 rows x 32 columns]

1.2.2 Test the normality of Data set¶

Not all columns in this dataset display a normal distribution, which can influence statistical assumptions. In such cases, it's recommended to adopt analytical approaches tailored for non-normal distributions. When using machine learning models or conducting statistical analyses, awareness of non-normal distribution is crucial. Consider employing techniques robust to deviations from normality, like non-parametric methods or transformations, to ensure accurate and reliable results.

In [3]:
import pandas as pd
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt

# Assuming original data loaded into a DataFrame named 'original_data'
original_data = data

# Create a function to test normality and generate histograms
def test_normality_and_plot(df_column):
    # Convert the column to a numeric type, ignoring non-numeric values
    cleaned_data = pd.to_numeric(df_column, errors='coerce').dropna()

    # Check if there are at least three non-missing values
    if len(cleaned_data) < 3:
        print(f"{df_column.name}: Insufficient data for normality test (less than 3 values).\n")
        return

    # Perform the Shapiro-Wilk test
    stat, p_value = shapiro(cleaned_data)

    # Plot histogram
    plt.figure(figsize=(8, 6))
    plt.hist(cleaned_data, bins='auto', color='blue', edgecolor='black')
    plt.title(f'Histogram of {df_column.name}')
    plt.xlabel('Values')
    plt.ylabel('Frequency')
    plt.show()

    # Print test results
    print(f"{df_column.name}:")
    print(f"Shapiro-Wilk Test - p-value: {p_value}")
    if p_value > 0.05:
        print("Data appears to be normally distributed.\n")
    else:
        print("Data does not appear to be normally distributed.\n")

# Iterate through the columns using a for loop
for column in original_data.columns:
    test_normality_and_plot(original_data[column])
id:
Shapiro-Wilk Test - p-value: 2.04256066756914e-40
Data does not appear to be normally distributed.

diagnosis: Insufficient data for normality test (less than 3 values).

radius_mean:
Shapiro-Wilk Test - p-value: 9.995522747170693e-14
Data does not appear to be normally distributed.

texture_mean:
Shapiro-Wilk Test - p-value: 3.733706543469508e-33
Data does not appear to be normally distributed.

perimeter_mean:
Shapiro-Wilk Test - p-value: 2.5340305212246533e-14
Data does not appear to be normally distributed.

area_mean:
Shapiro-Wilk Test - p-value: 4.219632481566258e-21
Data does not appear to be normally distributed.

smoothness_mean:
Shapiro-Wilk Test - p-value: 0.021688221022486687
Data does not appear to be normally distributed.

compactness_mean:
Shapiro-Wilk Test - p-value: 6.089557027036396e-16
Data does not appear to be normally distributed.

concavity_mean:
Shapiro-Wilk Test - p-value: 7.982317468061152e-20
Data does not appear to be normally distributed.

concave points_mean:
Shapiro-Wilk Test - p-value: 9.80908925027372e-44
Data does not appear to be normally distributed.

symmetry_mean:
Shapiro-Wilk Test - p-value: 1.984799144869671e-41
Data does not appear to be normally distributed.

fractal_dimension_mean:
Shapiro-Wilk Test - p-value: 1.2990036764291054e-42
Data does not appear to be normally distributed.

radius_se:
Shapiro-Wilk Test - p-value: 5.4718801282130944e-27
Data does not appear to be normally distributed.

texture_se:
Shapiro-Wilk Test - p-value: 1.1859550658426937e-17
Data does not appear to be normally distributed.

perimeter_se:
Shapiro-Wilk Test - p-value: 3.284580112550128e-28
Data does not appear to be normally distributed.

area_se:
Shapiro-Wilk Test - p-value: 2.9167252963495484e-33
Data does not appear to be normally distributed.

smoothness_se:
Shapiro-Wilk Test - p-value: 2.468378084179224e-22
Data does not appear to be normally distributed.

compactness_se:
Shapiro-Wilk Test - p-value: 2.3120763398623534e-22
Data does not appear to be normally distributed.

concavity_se:
Shapiro-Wilk Test - p-value: 4.706028250861632e-30
Data does not appear to be normally distributed.

concave points_se:
Shapiro-Wilk Test - p-value: 8.647694995324502e-16
Data does not appear to be normally distributed.

symmetry_se:
Shapiro-Wilk Test - p-value: 9.280427270639034e-23
Data does not appear to be normally distributed.

fractal_dimension_se:
Shapiro-Wilk Test - p-value: 1.6255062186167878e-43
Data does not appear to be normally distributed.

radius_worst:
Shapiro-Wilk Test - p-value: 1.0698616399906931e-16
Data does not appear to be normally distributed.

texture_worst:
Shapiro-Wilk Test - p-value: 1.3853089512849692e-05
Data does not appear to be normally distributed.

perimeter_worst:
Shapiro-Wilk Test - p-value: 1.8719321649731496e-34
Data does not appear to be normally distributed.

area_worst:
Shapiro-Wilk Test - p-value: 1.2546626015983025e-23
Data does not appear to be normally distributed.

smoothness_worst:
Shapiro-Wilk Test - p-value: 0.0018222584621980786
Data does not appear to be normally distributed.

compactness_worst:
Shapiro-Wilk Test - p-value: 1.536926724414147e-17
Data does not appear to be normally distributed.

concavity_worst:
Shapiro-Wilk Test - p-value: 8.437772914182337e-15
Data does not appear to be normally distributed.

concave points_worst:
Shapiro-Wilk Test - p-value: 4.832046762714981e-09
Data does not appear to be normally distributed.

symmetry_worst:
Shapiro-Wilk Test - p-value: 2.710039033697179e-16
Data does not appear to be normally distributed.

fractal_dimension_worst:
Shapiro-Wilk Test - p-value: 3.0740765890703364e-16
Data does not appear to be normally distributed.

1.2.3 Encoded data and handle outliers¶

When dealing with a dataset that is not normally distributed, addressing outliers becomes crucial for maintaining the integrity of statistical analyses and machine learning models. Outliers, or extreme values, can significantly influence summary statistics and model performance. One common method for identifying and handling outliers is the Interquartile Range (IQR) method.

In cases where the dataset includes categorical variables, like the 'diagnosis' column in this scenario, directly applying the IQR method might not be feasible. To overcome this limitation, encoding the categorical data is necessary, converting it into a numerical format that allows for the application of outlier detection techniques.

acknowledging and managing outliers is essential for robust data analysis. The choice of an appropriate method, such as IQR, is contingent on the distribution of the data and the nature of the variables involved. The encoding of categorical variables facilitates outlier detection and removal, ensuring a more reliable and accurate analysis.

In [56]:
 # Encoded data
encoded_data = data

# Assuming 'diagnosis' is the categorical column
diagnosis_encoded = pd.get_dummies(encoded_data['diagnosis'], prefix='diagnosis', drop_first=True)

# Drop the original 'diagnosis' column and concatenate the one-hot encoded columns
encoded_data = pd.concat([encoded_data.drop('diagnosis', axis=1), diagnosis_encoded], axis=1)

# Display the encoded dataset
print(encoded_data)
             id  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302.0        17.99         10.38          122.80     1001.0   
1      842517.0        20.57         17.77          132.90     1326.0   
2    84300903.0        19.69         21.25          130.00     1203.0   
3    84348301.0        11.42         20.38           77.58      386.1   
4    84358402.0        20.29         14.34          135.10     1297.0   
..          ...          ...           ...             ...        ...   
564    926125.0        20.92         25.09          143.00     1347.0   
565    926424.0        21.56         22.39          142.00     1479.0   
567    926954.0        16.60         28.08          108.30      858.1   
568    927241.0        20.60         29.33          140.10     1265.0   
569     92751.0         7.76         24.54           47.92      181.0   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0            0.11840           0.27760         0.30010              0.14710   
1            0.08474           0.07864         0.08690              0.07017   
2            0.10960           0.15990         0.19740              0.12790   
3            0.14250           0.28390         0.24140              0.10520   
4            0.10030           0.13280         0.19800              0.10430   
..               ...               ...             ...                  ...   
564          0.10990           0.22360         0.31740              0.14740   
565          0.11100           0.11590         0.24390              0.13890   
567          0.08455           0.10230         0.09251              0.05302   
568          0.11780           0.27700         0.35140              0.15200   
569          0.05263           0.04362         0.00000              0.00000   

     symmetry_mean  ...  texture_worst  perimeter_worst  area_worst  \
0           0.2419  ...          17.33           184.60      2019.0   
1           0.1812  ...          23.41           158.80      1956.0   
2           0.2069  ...          25.53           152.50      1709.0   
3           0.2597  ...          26.50            98.87       567.7   
4           0.1809  ...          16.67           152.20      1575.0   
..             ...  ...            ...              ...         ...   
564         2.1000  ...          29.41           179.10      1819.0   
565         0.1726  ...          26.40           166.10      2027.0   
567         0.1590  ...          34.12           126.70      1124.0   
568         0.2397  ...          39.42           184.60      1821.0   
569         0.1587  ...          30.37            59.16       268.6   

     smoothness_worst  compactness_worst  concavity_worst  \
0             0.16220            0.66560           0.7119   
1             0.12380            0.18660           0.2416   
2             0.14440            0.42450           0.4504   
3             0.20980            0.86630           0.6869   
4             0.13740            0.20500           0.4000   
..                ...                ...              ...   
564           0.14070            0.41860           0.6599   
565           0.14100            0.21130           0.4107   
567           0.11390            0.30940           0.3403   
568           0.16500            0.86810           0.9387   
569           0.08996            0.06444           0.0000   

     concave points_worst  symmetry_worst  fractal_dimension_worst  \
0                  0.2654          0.4601                  0.11890   
1                  0.1860          0.2750                  0.08902   
2                  0.2430          0.3613                  0.08758   
3                  0.2575          0.6638                  0.17300   
4                  0.1625          0.2364                  0.07678   
..                    ...             ...                      ...   
564                0.2542          0.2929                  0.09873   
565                0.2216          0.2060                  0.07115   
567                0.1418          0.2218                  0.07820   
568                0.2650          0.4087                  0.12400   
569                0.0000          0.2871                  0.07039   

     diagnosis_M  
0           True  
1           True  
2           True  
3           True  
4           True  
..           ...  
564         True  
565         True  
567         True  
568         True  
569        False  

[480 rows x 32 columns]
In [57]:
original_data = encoded_data.copy()

# Create a function to detect and handle outliers using IQR method
def handle_outliers_using_iqr(df_column):
    # Check if the column contains numeric values
    if pd.api.types.is_numeric_dtype(df_column):
        # Convert the column to a numeric type, ignoring non-numeric values
        cleaned_data = pd.to_numeric(df_column, errors='coerce')

        # Drop NaN values
        cleaned_data = cleaned_data.dropna()

        if not cleaned_data.empty:
            # Calculate the first and third quartiles
            q1 = cleaned_data.quantile(0.25)
            q3 = cleaned_data.quantile(0.75)

            # Calculate the IQR (Interquartile Range)
            iqr_value = iqr(cleaned_data)

            # Define the lower and upper bounds for outliers
            lower_bound = q1 - 1.5 * iqr_value
            upper_bound = q3 + 1.5 * iqr_value

            # Identify and handle outliers
            outliers = (cleaned_data < lower_bound) | (cleaned_data > upper_bound)
            cleaned_data[outliers] = np.nan  # Set outliers to NaN

    else:
        cleaned_data = df_column

    return cleaned_data

# Apply the function to each column in the original_data DataFrame
cleaned_data = original_data.applymap(handle_outliers_using_iqr)

# Display the cleaned dataset
print(cleaned_data)
             id  radius_mean  texture_mean  perimeter_mean  area_mean  \
0      842302.0        17.99         10.38          122.80     1001.0   
1      842517.0        20.57         17.77          132.90     1326.0   
2    84300903.0        19.69         21.25          130.00     1203.0   
3    84348301.0        11.42         20.38           77.58      386.1   
4    84358402.0        20.29         14.34          135.10     1297.0   
..          ...          ...           ...             ...        ...   
564    926125.0        20.92         25.09          143.00     1347.0   
565    926424.0        21.56         22.39          142.00     1479.0   
567    926954.0        16.60         28.08          108.30      858.1   
568    927241.0        20.60         29.33          140.10     1265.0   
569     92751.0         7.76         24.54           47.92      181.0   

     smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0            0.11840           0.27760         0.30010              0.14710   
1            0.08474           0.07864         0.08690              0.07017   
2            0.10960           0.15990         0.19740              0.12790   
3            0.14250           0.28390         0.24140              0.10520   
4            0.10030           0.13280         0.19800              0.10430   
..               ...               ...             ...                  ...   
564          0.10990           0.22360         0.31740              0.14740   
565          0.11100           0.11590         0.24390              0.13890   
567          0.08455           0.10230         0.09251              0.05302   
568          0.11780           0.27700         0.35140              0.15200   
569          0.05263           0.04362         0.00000              0.00000   

     symmetry_mean  ...  texture_worst  perimeter_worst  area_worst  \
0           0.2419  ...          17.33           184.60      2019.0   
1           0.1812  ...          23.41           158.80      1956.0   
2           0.2069  ...          25.53           152.50      1709.0   
3           0.2597  ...          26.50            98.87       567.7   
4           0.1809  ...          16.67           152.20      1575.0   
..             ...  ...            ...              ...         ...   
564         2.1000  ...          29.41           179.10      1819.0   
565         0.1726  ...          26.40           166.10      2027.0   
567         0.1590  ...          34.12           126.70      1124.0   
568         0.2397  ...          39.42           184.60      1821.0   
569         0.1587  ...          30.37            59.16       268.6   

     smoothness_worst  compactness_worst  concavity_worst  \
0             0.16220            0.66560           0.7119   
1             0.12380            0.18660           0.2416   
2             0.14440            0.42450           0.4504   
3             0.20980            0.86630           0.6869   
4             0.13740            0.20500           0.4000   
..                ...                ...              ...   
564           0.14070            0.41860           0.6599   
565           0.14100            0.21130           0.4107   
567           0.11390            0.30940           0.3403   
568           0.16500            0.86810           0.9387   
569           0.08996            0.06444           0.0000   

     concave points_worst  symmetry_worst  fractal_dimension_worst  \
0                  0.2654          0.4601                  0.11890   
1                  0.1860          0.2750                  0.08902   
2                  0.2430          0.3613                  0.08758   
3                  0.2575          0.6638                  0.17300   
4                  0.1625          0.2364                  0.07678   
..                    ...             ...                      ...   
564                0.2542          0.2929                  0.09873   
565                0.2216          0.2060                  0.07115   
567                0.1418          0.2218                  0.07820   
568                0.2650          0.4087                  0.12400   
569                0.0000          0.2871                  0.07039   

     diagnosis_M  
0           True  
1           True  
2           True  
3           True  
4           True  
..           ...  
564         True  
565         True  
567         True  
568         True  
569        False  

[480 rows x 32 columns]

1.2.4 Split the dataset¶

Splitting the dataset into training and test sets is essential for accurately assessing a model's performance, preventing overfitting, tuning hyperparameters, avoiding data leakage, and selecting the best model for a given task. In this standard practice, typically, the training set comprises 80% of the data, while the test set comprises the remaining 20%. This division allows for robust evaluation of the model's generalization ability while ensuring adequate training data for model learning.

In [58]:
from sklearn.model_selection import train_test_split
# Split the dataset
X = cleaned_data.drop(['id', 'diagnosis_M'], axis=1)
y = cleaned_data['diagnosis_M']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1.2.5 StandardScaler¶

The StandardScaler is a versatile preprocessing technique that offers several advantages, including improved convergence, equal contribution of features, stabilized variance, simplified interpretation, facilitated regularization, improved algorithm performance, and handling of outliers.

In [59]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler instance
scaler = StandardScaler()

# Fit and transform the training data  using the Standard scaler
X_train_scaled = scaler.fit_transform(X_train)

# Transform the testing data using the Standard scaler
X_test_scaled = scaler.transform(X_test)

1.2.6 The square root transform¶

The given dataset exhibits a non-normal distribution and contains both zero and negative values, which limits the options for data transformation. The Log Transform and The Box-Cox Transform are unsuitable in this scenario due to the presence of zero and negative values. However, The Root Transform, specifically the square root transform, remains a viable option. This transformation technique is widely used in statistics and data analysis to address skewed data or mitigate outlier effects. By applying the square root to each observation, the data distribution tends to become more symmetric, facilitating certain statistical analyses. Moreover, the square root transform has the potential to stabilize the variance of the data, thus ensuring consistency across different levels of the independent variable. This feature is particularly advantageous in regression analysis scenarios.

In [60]:
import numpy as np
# Transform the training data and testing data  using the square root
X_train_sqrt_transformed = np.sqrt(X_train_scaled)
X_test_sqrt_transformed = np.sqrt(X_test_scaled)
C:\Users\User\AppData\Local\Temp\ipykernel_10812\3660086394.py:3: RuntimeWarning: invalid value encountered in sqrt
  X_train_sqrt_transformed = np.sqrt(X_train_scaled)
C:\Users\User\AppData\Local\Temp\ipykernel_10812\3660086394.py:4: RuntimeWarning: invalid value encountered in sqrt
  X_test_sqrt_transformed = np.sqrt(X_test_scaled)

1.2.7 Explanation of Diagnosis Distribution (Malignant vs. Benign)¶

The dataset focuses on cancer diagnosis, where "diagnosis" identifies tumors as malignant (M) or benign (B). A distribution graph displays the frequency of each diagnosis type, with benign cases outnumbering malignant ones. This imbalance poses challenges, potentially skewing analysis and model predictions. Reasons for the disparity may include tumor occurrence rates or data biases. To ensure accurate insights, addressing this imbalance through techniques like resampling or alternative modeling approaches is crucial. Understanding and mitigating class imbalance are essential for reliable cancer diagnosis insights, shaping subsequent analysis steps.

In [39]:
import numpy as np
from scipy.stats import shapiro
import matplotlib.pyplot as plt

plt.figure(figsize=(6, 4))
sns.countplot(x='diagnosis_M', data=cleaned_data)
plt.title('Distribution of Diagnosis (Malignant vs. benign)')
plt.show()

1.2.8 Correlation matrix of the dataset¶

This figure depicts the correlation matrix of the dataset, showcasing the relationships between different columns. High correlation values between columns indicate strong linear relationships, suggesting that changes in one variable coincide with changes in another. The presence of highly correlated columns implies redundancy or multicollinearity within the dataset. In other words, some features may convey similar information, potentially affecting the performance of machine learning models by introducing noise or instability. Addressing multicollinearity through feature selection or dimensionality reduction techniques can enhance model interpretability and predictive accuracy.

In [37]:
# Correlation matrix
correlation_matrix = cleaned_data.corr()

# Heatmap visualization of correlation matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

# Pair plot for selected numerical variables
sns.pairplot(cleaned_data[['radius_mean', 'texture_mean', 'perimeter_mean', 'diagnosis_M']], hue='diagnosis_M')
plt.title('Pair Plot')
plt.show()
In [38]:
sns.pairplot(cleaned_data, hue='diagnosis_M', markers=['o', 's'])
plt.suptitle('Pairwise Scatter Plots with Target Variable')
plt.show()

1.2.9 Removing highly correlated columns¶

Removing highly correlated columns is a form of feature extraction that eliminates redundant information, enhancing the dataset's relevance and predictive power. By streamlining the dataset, this process optimizes model performance by reducing overfitting and improving interpretability. Ultimately, it aims to distill the dataset to its most salient attributes, enhancing its utility for accurate and efficient machine learning

In [8]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Impute missing values using SimpleImputer for X_train
imputer = SimpleImputer(strategy='mean')
X_train_imputed_array = imputer.fit_transform(X_train_scaled)
X_test_imputed_array = imputer.fit_transform(X_test_scaled)

# Ensure X_train_log_transformed is a DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled)
X_test_scaled = pd.DataFrame(X_test_scaled)

# Convert NumPy array to a Pandas DataFrame with column names for X_train
X_train_imputed_df = pd.DataFrame(X_train_imputed_array, columns=X_train_scaled.columns)
X_test_imputed_df = pd.DataFrame(X_test_imputed_array, columns=X_test_scaled.columns)

# Concatenate X_train and y_train for correlation analysis
train_data = pd.concat([X_train_imputed_df, y_train], axis=1)

# Calculate the correlation matrix for X_train
correlation_matrix = train_data.corr()

# Plot a heatmap of the correlation matrix for X_train
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap (X_train)')
plt.show()

# Drop highly correlated features (adjust the threshold as needed) for X_train
correlation_threshold = 0.8
upper_tri = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1) == 1)
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > correlation_threshold)]
X_train_no_corr = X_train_imputed_df.drop(columns=to_drop)
X_test_no_corr = X_test_imputed_df.drop(columns=to_drop)

# Concatenate X_train_no_corr and y_train for correlation analysis
train_data_no_corr = pd.concat([X_train_no_corr, y_train], axis=1)

# Calculate the correlation matrix for X_train_no_corr
correlation_matrix_no_corr = train_data_no_corr.corr()

# Plot a heatmap of the correlation matrix after removing columns for X_train_no_corr
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_no_corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('After_remove Correlation Matrix Heatmap (X_train_no_corr)')
plt.show()
In [61]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')  # You can choose other strategies like 'median' or 'most_frequent'
X_train_imputed_array_sqrt = imputer.fit_transform(X_train_sqrt_transformed)
X_test_imputed_array_sqrt = imputer.transform(X_test_sqrt_transformed)

# Ensure X_train_log_transformed is a DataFrame
X_train_sqrt_transformed = pd.DataFrame(X_train_sqrt_transformed)
X_test_sqrt_transformed = pd.DataFrame(X_test_sqrt_transformed)

# Convert NumPy arrays to Pandas DataFrames with column names
X_train_imputed_sqrt_df = pd.DataFrame(X_train_imputed_array_sqrt, columns=X_train_sqrt_transformed.columns)
X_test_imputed_sqrt_df = pd.DataFrame(X_test_imputed_array_sqrt, columns=X_test_sqrt_transformed.columns)

# Concatenate X_train and y_train for correlation analysis
train_data_sqrt = pd.concat([X_train_imputed_sqrt_df, y_train], axis=1)

# Calculate the correlation matrix
correlation_matrix_sqrt = train_data_sqrt.corr()

# Plot a heatmap of the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_sqrt, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix_sqrt Heatmap')
plt.show()

# Drop highly correlated features (adjust the threshold as needed)
correlation_threshold = 0.8
upper_tri = correlation_matrix_sqrt.where(np.triu(np.ones(correlation_matrix_sqrt.shape), k=1) == 1)
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > correlation_threshold)]

X_train_no_corr_sqrt = X_train_imputed_sqrt_df.drop(columns=to_drop)
X_test_no_corr_sqrt = X_test_imputed_sqrt_df.drop(columns=to_drop)

# Plot a heatmap of the correlation matrix after remove column
# Concatenate X_train and y_train for correlation analysis
train_data_no_corr_sqrt = pd.concat([X_train_no_corr_sqrt, y_train], axis=1)

# Calculate the correlation matrix
correlation_matrix_sqrt = train_data_no_corr_sqrt.corr()

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix_sqrt, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('After_remove Correlation Matri_sqrtx Heatmap')
plt.show()

1.3 Conclusion¶

the Exploratory Data Analysis (EDA) provides valuable insights into the dataset's characteristics and relationships between variables. Through informative plots and statistical summaries, we have gained a deeper understanding of the distribution, patterns, and correlations within the data. The analysis revealed significant class imbalance in the diagnosis variable, emphasizing the need for careful consideration and mitigation strategies in subsequent modeling efforts. Furthermore, the correlation matrix highlighted potential multicollinearity issues, suggesting avenues for feature selection or dimensionality reduction. Overall, the EDA process has laid a solid foundation for further analysis and modeling, guiding us towards more informed decision-making and actionable insights in cancer diagnosis and treatment contexts.

2.1 Introduction¶

In the realm of data analysis, leveraging meticulously pre-processed features and attributes from Exploratory Data Analysis (EDA) serves as the cornerstone for comprehensive insights. This phase initiates an unsupervised machine learning analysis, aimed at uncovering deeper nuances in the dataset through clustering or dimensionality reduction techniques. The overarching goal is to elucidate underlying structures and patterns, unimpeded by the constraints of labeled data. Various clustering algorithms, including K-means, hierarchical, and DBScan, are considered to unveil inherent data groupings. Additionally, dimensionality reduction explores if a streamlined feature set effectively captures observations. Critical to this analysis is the meticulous evaluation of algorithms, using appropriate metrics to gauge efficacy. This methodical approach enriches understanding, facilitates informed decision-making, and sets the stage for further dataset exploration.

2.2Methodology¶

2.2.1Determine the optimal number of clusters using the silhouette score¶

The code for preprocessing the First Dataset, the code for preprocessing the Second Dataset, and the code for preprocessing the Third Dataset perform the following steps.

  • Imputing Missing Values: It imputes missing values in the training and test datasets using the mean strategy through SimpleImputer.

  • Dimensionality Reduction with PCA: The code employs PCA (Principal Component Analysis) for dimensionality reduction. PCA is applied to both the training and test datasets. After applying PCA to the dataset, the algorithm determines the optimal number of principal components to retain based on the specified criterion, which is to retain 95% of the variance in the data. The ncomponents attribute of the PCA object stores the actual number of components selected by PCA. This PCA transformation aids in comprehending the amount of information retained in the reduced-dimensional space, offering insights into the impact of PCA on the dataset and facilitating the interpretation of subsequent analyses or modeling steps.

  • Silhouette Score Calculation: It then iterates over different numbers of clusters ranging from 2 to 10. For each number of clusters, it applies KMeans clustering to the PCA-transformed training data and computes the silhouette score, which measures how similar an object is to its own cluster compared to other clusters.

  • Finding Optimal Number of Clusters: After calculating silhouette scores for different numbers of clusters, it identifies the number of clusters that maximizes the silhouette score.

  • Visualization: Finally, it visualizes the silhouette scores for different numbers of clusters with a line plot. It marks the optimal number of clusters with a vertical dashed line.

I followed these steps using three different datasets:

  • The first dataset, consisting of X_train_no_corr_sqrt and X_test_no_corr_sqrt, originated from the original dataset. I applied StandardScaler transformation followed by square root transformation, and then removed highly correlated columns.

  • The second dataset, containing X_train_no_corr and X_test_no_corr, also originated from the original dataset. I applied StandardScaler transformation and then removed highly correlated columns.

  • The third dataset, comprising X_train_scaled and X_test_scaled, was derived from the original dataset. I applied StandardScaler transformation to this dataset.

2.2.1.1 Preprocessing Steps for the First Dataset¶

In [62]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean') 
X_train_imputed_no_corr_sqrt = imputer.fit_transform(X_train_no_corr_sqrt )
X_test_imputed_no_corr_sqrt = imputer.transform(X_test_no_corr_sqrt )

pca = PCA(n_components=0.95)  
X_train_pca_no_corr_sqrt = pca.fit_transform(X_train_imputed_no_corr_sqrt)
X_test_pca_no_corr_sqrt = pca.transform(X_test_imputed_no_corr_sqrt)

# Check the number of components selected by PCA
print(f'Number of components selected by PCA: {pca.n_components_}')


# Experiment with different numbers of clusters using PCA-transformed data
cluster_range = range(2, 11)

silhouette_scores_pca_no_corr_sqrt = []

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(X_train_pca_no_corr_sqrt)
    
    silhouette_avg = silhouette_score(X_train_pca_no_corr_sqrt, clusters)
    silhouette_scores_pca_no_corr_sqrt.append(silhouette_avg)

# Find the optimal number of clusters using PCA-transformed data
optimal_clusters_pca_no_corr_sqrt = cluster_range[np.argmax(silhouette_scores_pca_no_corr_sqrt)]

# Visualize the silhouette scores for different cluster numbers with PCA
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores_pca_no_corr_sqrt, marker='o')
plt.title('Silhouette Score vs. Number of Clusters (with PCA)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.axvline(x=optimal_clusters_pca_no_corr_sqrt, color='red', linestyle='--', label=f'Optimal Clusters: {optimal_clusters_pca_no_corr_sqrt}')
plt.legend()
plt.show()

print(f'The optimal number of clusters with PCA is: {optimal_clusters_pca_no_corr_sqrt}')
Number of components selected by PCA: 20
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
The optimal number of clusters with PCA is: 2

2.2.1.2 Preprocessing Steps for the Second Dataset¶

In [11]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean')  # You can choose other strategies like 'median' or 'most_frequent'
X_train_no_corr_imputed = imputer.fit_transform(X_train_no_corr)
X_test_no_corr_imputed = imputer.transform(X_test_no_corr)

pca = PCA(n_components=0.95)  # Choose the number of components to retain a certain percentage of variance
X_train_no_corr_pca = pca.fit_transform(X_train_no_corr_imputed)
X_test_no_corr_pca = pca.transform(X_test_no_corr_imputed)

# Check the number of components selected by PCA
print(f'Number of components selected by PCA: {pca.n_components_}')


# Experiment with different numbers of clusters using PCA-transformed data
cluster_range = range(2, 11)

silhouette_scores_no_corr_pca = []

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(X_train_no_corr_pca)
    
    silhouette_avg = silhouette_score(X_train_no_corr_pca, clusters)
    silhouette_scores_no_corr_pca.append(silhouette_avg)

# Find the optimal number of clusters using PCA-transformed data
optimal_clusters_no_corr_pca = cluster_range[np.argmax(silhouette_scores_no_corr_pca)]

# Visualize the silhouette scores for different cluster numbers with PCA
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores_no_corr_pca, marker='o')
plt.title('Silhouette Score vs. Number of Clusters (with PCA)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.axvline(x=optimal_clusters_no_corr_pca, color='red', linestyle='--', label=f'Optimal Clusters: {optimal_clusters_no_corr_pca}')
plt.legend()
plt.show()

print(f'The optimal number of clusters with PCA is: {optimal_clusters_no_corr_pca}')
Number of components selected by PCA: 14
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
The optimal number of clusters with PCA is: 2

2.2.1.3 Preprocessing Steps for the third Dataset¶

In [12]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA


pca = PCA(n_components=0.95)  
X_train_scaled_pca = pca.fit_transform(X_train_scaled)
X_test_scaled_pca = pca.transform(X_test_scaled)

# Check the number of components selected by PCA
print(f'Number of components selected by PCA: {pca.n_components_}')


# Experiment with different numbers of clusters using PCA-transformed data
cluster_range = range(2, 11)

silhouette_scores_scaled_pca = []

for n_clusters in cluster_range:
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(X_train_scaled_pca)
    
    silhouette_avg = silhouette_score(X_train_scaled_pca, clusters)
    silhouette_scores_scaled_pca.append(silhouette_avg)

# Find the optimal number of clusters using PCA-transformed data
optimal_clusters_scaled_pca = cluster_range[np.argmax(silhouette_scores_scaled_pca)]

# Visualize the silhouette scores for different cluster numbers with PCA
plt.figure(figsize=(10, 6))
plt.plot(cluster_range, silhouette_scores_scaled_pca, marker='o')
plt.title('Silhouette Score vs. Number of Clusters (with PCA)')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.axvline(x=optimal_clusters_scaled_pca, color='red', linestyle='--', label=f'Optimal Clusters: {optimal_clusters_scaled_pca}')
plt.legend()
plt.show()

print(f'The optimal number of clusters with PCA is: {optimal_clusters_scaled_pca}')
Number of components selected by PCA: 14
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
The optimal number of clusters with PCA is: 2

2.2.2 Determines the optimal number of clusters using the Elbow method¶

The code performs the following steps

  • Dimensionality Reduction with PCA: Reduces the dimensionality of the dataset to two principal components using PCA.
  • Elbow Method for Optimal K: Determines the optimal number of clusters using the Elbow method by fitting K-means models with different values of K and calculating the inertia for each K.
  • K-means Clustering: Performs K-means clustering with the optimal number of clusters obtained from the Elbow method.
  • Visualization of Clusters: Visualizes the clusters in a 2D space using the first two principal components.

The dataset used for the above steps, comprising X_train_no_corr_sqrt and X_test_no_corr_sqrt, was derived from the original dataset. StandardScaler transformation was initially applied to standardize the features, followed by a square root transformation. Subsequently, highly correlated columns were removed from the dataset to mitigate multicollinearity. These preprocessed datasets were then utilized for the subsequent analysis steps.

In [28]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns


# Perform dimensionality reduction for visualization (e.g., using PCA)
pca = PCA(n_components=2)
X_pca_sqrt = pca.fit_transform(X_train_no_corr_sqrt)

# Determine the optimal number of clusters using the Elbow method
inertia = []
for k in range(1, 5):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_train_no_corr_sqrt)
    inertia.append(kmeans.inertia_)

# Plot the Elbow curve
plt.figure(figsize=(8, 6))
plt.plot(range(1, 11), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.show()

# Based on the Elbow method, choose the optimal number of clusters (K)
optimal_k = 2


# Perform K-means clustering
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
cluster_labels = kmeans.fit_predict(X_train_no_corr_sqrt)  # Get cluster labels

# Add cluster labels to the DataFrame
X_train_no_corr_sqrt['cluster'] = cluster_labels

# Visualize the clusters using the first two principal components
plt.figure(figsize=(10, 8))
sns.scatterplot(x=X_pca_sqrt[:, 0], y=X_pca_sqrt[:, 1], hue=X_train_no_corr_sqrt['cluster'], palette='viridis', legend='full')
plt.title('K-means Clustering (2D PCA)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(

2.2.3 K-Means Clustering (Reduced Dimensions)¶

In this code, a comprehensive workflow for clustering analysis is implemented, starting from data preprocessing, followed by dimensionality reduction using PCA, clustering with a specified number of clusters obtained from a previous output, and finally, evaluation and visualization of the clustering results using silhouette score and scatter plot of the reduced dimensions. Specifically, the number of clusters is set to 2, and the number of principal components for PCA is 20, which determined based on previous outputs.

In [24]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from sklearn.impute import SimpleImputer


# Impute missing values using SimpleImputer
imputer = SimpleImputer(strategy='mean') 
X_train_imputed = imputer.fit_transform(X_train_no_corr_sqrt)
X_test_imputed = imputer.transform(X_test_no_corr_sqrt)

# Dimensionality reduction using PCA
pca = PCA(n_components=20)
X_train_pca = pca.fit_transform(X_train_imputed)
X_test_pca = pca.transform(X_test_imputed)

# Clustering using K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
X_train_clusters = kmeans.fit_predict(X_train_imputed)

# Evaluate clustering using silhouette score
silhouette_avg = silhouette_score(X_train_imputed, X_train_clusters)
print(f'Silhouette Score: {silhouette_avg}')

# Visualize clusters in reduced dimensions
plt.figure(figsize=(8, 6))
plt.scatter(X_train_pca[:, 0], X_train_pca[:, 1], c=X_train_clusters, cmap='viridis', alpha=0.7)
plt.title('K-Means Clustering (Reduced Dimensions)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1412: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  super()._check_params_vs_input(X, default_n_init=10)
D:\Karthika University\Python\Program\Lib\site-packages\sklearn\cluster\_kmeans.py:1436: UserWarning: KMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can avoid it by setting the environment variable OMP_NUM_THREADS=2.
  warnings.warn(
Silhouette Score: 0.43147191367929816
Out[24]:
Text(0, 0.5, 'Principal Component 2')

2.3 Conclusion¶

Based on the observation that the first dataset achieved the highest silhouette score of 0.44 and utilized the highest number of principal components (20), it appears to be the most informative dataset for clustering analysis. Therefore, using this dataset for determining the optimal number of clusters through the Elbow method and K-Means Clustering (Reduced Dimensions) is reasonable.

After applying PCA with 20 components, the silhouette score obtained is approximately 0.43, indicating a reasonable clustering structure. Additionally, the Elbow method, when applied to this dataset, also suggests a cluster number of 2.

In conclusion, the first dataset, with preprocessing steps including square root transformation and removal of highly correlated columns, provides the most robust basis for clustering analysis, as it yields the highest silhouette score and effectively guides the determination of the optimal number of clusters through both PCA and the Elbow method.

3.1 Introduction¶

In this analysis phase, we aim to utilize supervised machine learning techniques to classify breast cancer cases as benign or malignant and predict numerical features. Leveraging insights from pre-processing, exploratory data analysis (EDA), and unsupervised learning, our goal is to build robust models that accurately capture underlying data patterns. By employing various classification algorithms and regression techniques, we will assess model performance using appropriate evaluation metrics such as accuracy, precision, recall, F1-score, mean squared error (MSE), or R-squared. This integrated approach seeks to enhance decision-making in breast cancer diagnosis and prognosis by effectively capturing complex relationships within the dataset.

3.2 Methodology( Classification and Regression)¶

3.2.1 The analysis provides an evaluation of various classifiers on the dataset¶

My code performs the following tasks:

  • DataFrame Reconstruction: It reconstructs the DataFrame from the NumPy array X_train_pca_no_corr_sqrt, assuming it contains the features of interest.

  • Data Alignment: It ensures that the indices of the DataFrame containing features (X) and the target variable (y) are aligned. This alignment guarantees that they can be correctly split and utilized for training and testing purposes.

  • Conversion to NumPy Arrays: It converts the pandas DataFrames to NumPy arrays if they are DataFrame objects. This conversion is necessary because scikit-learn models typically work with NumPy arrays.

  • Classifier Initialization: It initializes several classifiers, including Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, SVM, KNN, and Naive Bayes. These classifiers will be trained and evaluated on the dataset.

  • Training and Evaluation: It iterates over each classifier, fits it to the training data, makes predictions on the test data, and evaluates the classifier's performance using metrics such as accuracy, confusion matrix, and classification report.

  • Printing Results: It prints out the accuracy, confusion matrix, and classification report for each classifier. This allows for a comprehensive comparison of the classifiers' effectiveness in solving the classification problem at hand.

In [50]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Recreate the DataFrame from the NumPy array without specifying the index
X_train_pca_no_corr_sqrt_df = pd.DataFrame(X_train_pca_no_corr_sqrt)

# Assuming 'y' is also a DataFrame, you may need to adjust accordingly if it's a Series
common_indices = y.index.intersection(X_train_pca_no_corr_sqrt_df.index)
X_train_pca_no_corr_sqrt_df = X_train_pca_no_corr_sqrt_df.loc[common_indices]
y = y.loc[common_indices]

X_train, X_test, y_train, y_test = train_test_split(X_train_pca_no_corr_sqrt_df, y, test_size=0.2, random_state=42, stratify=y)

# Convert to NumPy arrays if using pandas DataFrames
X_train = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
X_test = X_test.values if isinstance(X_test, pd.DataFrame) else X_test

# Initialize classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB()
}

# Train and evaluate classifiers
for name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print(f"Classifier: {name}")
    print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
    print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
    print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
    print("\n")
Classifier: Logistic Regression
Accuracy: 0.49230769230769234
Confusion Matrix:
[[28  8]
 [25  4]]
Classification Report:
              precision    recall  f1-score   support

       False       0.53      0.78      0.63        36
        True       0.33      0.14      0.20        29

    accuracy                           0.49        65
   macro avg       0.43      0.46      0.41        65
weighted avg       0.44      0.49      0.44        65



Classifier: Decision Tree
Accuracy: 0.5384615384615384
Confusion Matrix:
[[18 18]
 [12 17]]
Classification Report:
              precision    recall  f1-score   support

       False       0.60      0.50      0.55        36
        True       0.49      0.59      0.53        29

    accuracy                           0.54        65
   macro avg       0.54      0.54      0.54        65
weighted avg       0.55      0.54      0.54        65



Classifier: Random Forest
Accuracy: 0.46153846153846156
Confusion Matrix:
[[21 15]
 [20  9]]
Classification Report:
              precision    recall  f1-score   support

       False       0.51      0.58      0.55        36
        True       0.38      0.31      0.34        29

    accuracy                           0.46        65
   macro avg       0.44      0.45      0.44        65
weighted avg       0.45      0.46      0.45        65



Classifier: Gradient Boosting
Accuracy: 0.4
Confusion Matrix:
[[17 19]
 [20  9]]
Classification Report:
              precision    recall  f1-score   support

       False       0.46      0.47      0.47        36
        True       0.32      0.31      0.32        29

    accuracy                           0.40        65
   macro avg       0.39      0.39      0.39        65
weighted avg       0.40      0.40      0.40        65



Classifier: SVM
Accuracy: 0.5538461538461539
Confusion Matrix:
[[33  3]
 [26  3]]
Classification Report:
              precision    recall  f1-score   support

       False       0.56      0.92      0.69        36
        True       0.50      0.10      0.17        29

    accuracy                           0.55        65
   macro avg       0.53      0.51      0.43        65
weighted avg       0.53      0.55      0.46        65



Classifier: KNN
Accuracy: 0.47692307692307695
Confusion Matrix:
[[21 15]
 [19 10]]
Classification Report:
              precision    recall  f1-score   support

       False       0.53      0.58      0.55        36
        True       0.40      0.34      0.37        29

    accuracy                           0.48        65
   macro avg       0.46      0.46      0.46        65
weighted avg       0.47      0.48      0.47        65



Classifier: Naive Bayes
Accuracy: 0.5384615384615384
Confusion Matrix:
[[15 21]
 [ 9 20]]
Classification Report:
              precision    recall  f1-score   support

       False       0.62      0.42      0.50        36
        True       0.49      0.69      0.57        29

    accuracy                           0.54        65
   macro avg       0.56      0.55      0.54        65
weighted avg       0.56      0.54      0.53        65



3.2.2 Evaluation of Logistic Regression¶

Here's a breakdown of the workflow implemented in my code:

  • Data Preparation: The DataFrame X_train_pca_no_corr_sqrt_df is reconstructed from the NumPy array X_train_pca_no_corr_sqrt. The indices of the X_train_pca_no_corr_sqrt_df DataFrame and the target variable y are aligned to ensure they match for proper data splitting.

  • A Logistic Regression classifier (LogisticRegression) is initialized. The classifier is trained using the training data (X_train and y_train) via the fit method. Prediction:

  • Predictions are made on the test set (X_test) using the trained logistic regression model. The predicted labels (y_pred) are obtained for the test set. Evaluation Metrics Calculation:

  • The predicted labels (y_pred) and true labels (y_test) are converted to numeric values (0 and 1) to calculate certain evaluation metrics. Evaluation metrics calculated include Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) using functions from sklearn.metrics. Model Evaluation:

  • Various evaluation metrics are printed to assess the performance of the logistic regression classifier. These include accuracy, MSE, RMSE, MAE, confusion matrix, and classification report. This workflow provides a systematic approach to building, training, and evaluating a logistic regression model for classification tasks.

In [51]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Recreate the DataFrame from the NumPy array without specifying the index
X_train_pca_no_corr_sqrt_df = pd.DataFrame(X_train_pca_no_corr_sqrt)

# Assuming 'y' is also a DataFrame, may need to adjust accordingly if it's a Series
common_indices = y.index.intersection(X_train_pca_no_corr_sqrt_df.index)
X_train_pca_no_corr_sqrt_df = X_train_pca_no_corr_sqrt_df.loc[common_indices]
y = y.loc[common_indices]

X_train, X_test, y_train, y_test = train_test_split(X_train_pca_no_corr_sqrt_df, y, test_size=0.2, random_state=42, stratify=y)

# Convert to NumPy arrays if using pandas DataFrames
X_train = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
X_test = X_test.values if isinstance(X_test, pd.DataFrame) else X_test

# Initialize Logistic Regression classifier
clf = LogisticRegression()

# Train the Logistic Regression classifier
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Convert boolean arrays to numeric values (0 and 1)
y_test_numeric = y_test.astype(int)
y_pred_numeric = y_pred.astype(int)

# Calculate evaluation metrics
mse = mean_squared_error(y_test_numeric, y_pred_numeric)
rmse = mean_squared_error(y_test_numeric, y_pred_numeric, squared=False)
mae = mean_absolute_error(y_test_numeric, y_pred_numeric)

# Evaluate the Logistic Regression classifier
print("Classifier: Logistic Regression")
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
Classifier: Logistic Regression
Accuracy: 0.49230769230769234
Mean Squared Error (MSE): 0.51
Root Mean Squared Error (RMSE): 0.71
Mean Absolute Error (MAE): 0.51
Confusion Matrix:
[[28  8]
 [25  4]]
Classification Report:
              precision    recall  f1-score   support

       False       0.53      0.78      0.63        36
        True       0.33      0.14      0.20        29

    accuracy                           0.49        65
   macro avg       0.43      0.46      0.41        65
weighted avg       0.44      0.49      0.44        65

3.2.3 This code calculates and visualizes the Receiver Operating Characteristic (ROC) curve for a binary classification model.¶

Here's a breakdown of what each part does

  • Import Libraries: from sklearn.metrics import roc_auc_score, roc_curve: Imports the functions needed to calculate the ROC curve and its area under the curve (AUC). import matplotlib.pyplot as plt: Imports Matplotlib for visualization.

  • Calculate Predicted Probabilities: y_prob = clf.predict_proba(X_test)[:, 1]: Predicts the probability of the positive class for each sample in the test set using the trained classifier (clf).

  • Compute AUC-ROC Score: roc_auc = roc_auc_score(y_test, y_prob): Calculates the AUC-ROC score using the true labels (y_test) and predicted probabilities (y_prob).

  • Compute ROC Curve: fpr, tpr, _ = roc_curve(y_test, y_prob): Computes the ROC curve by calculating the false positive rate (fpr) and true positive rate (tpr) at various thresholds.

  • Visualize ROC Curve: Plots the ROC curve using Matplotlib. plt.figure(figsize=(8, 6)): Sets the figure size. plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}'): Plots the ROC curve with the AUC value as a label. plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--'): Plots the diagonal line representing random guessing. Sets the axis labels, title, and legend. plt.show(): Displays the plot.

Overall, this code snippet provides a concise way to evaluate the performance of a binary classification model using the ROC curve and AUC-ROC score.

In [42]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Calculate predicted probabilities for the positive class
y_prob = clf.predict_proba(X_test)[:, 1]

# Compute AUC-ROC score
roc_auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {roc_auc}")

# Compute ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)

# Visualize ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR) or Recall')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.show()
AUC-ROC: 0.5507662835249042

3.3 Conclusion¶

The analysis indeed provides a conclusion for the classifiers' performance. It evaluates each classifier's performance using various metrics such as accuracy, mean squared error, confusion matrix, and classification report. Based on these metrics, conclusions can be drawn regarding how well each classifier performs in solving the classification problem at hand.

  • Logistic Regression: Achieved an accuracy of approximately 49.23%. While it demonstrated relatively high precision (53%) for the negative class (False), its recall for the negative class was notably higher (78%), indicating it is better at identifying negative instances. However, its performance for the positive class (True) was weaker, with lower precision (33%) and recall (14%).

  • Decision Tree: Achieved an accuracy of around 53.85%. It demonstrated balanced performance with precision and recall scores around 50% to 60% for both classes.

  • Random Forest: Attained an accuracy of approximately 46.15%. Similar to the decision tree, it showed balanced precision and recall scores for both classes but with slightly lower overall accuracy.

  • Gradient Boosting: Achieved an accuracy of 40%. While it demonstrated balanced precision and recall scores, its overall accuracy was lower compared to other classifiers.

  • Support Vector Machine (SVM): Attained an accuracy of approximately 55.38%. It showed higher precision and recall for the negative class (False) but lower precision and recall for the positive class (True).

  • K-Nearest Neighbors (KNN): Achieved an accuracy of around 47.69%. It demonstrated slightly imbalanced performance, with higher precision and recall for the negative class (False) compared to the positive class (True).

  • Naive Bayes: Achieved an accuracy of approximately 53.85%. It showed better precision and recall for the positive class (True) compared to the negative class (False).

The conclusion for the Logistic Regression classifier:

  • Accuracy: The Logistic Regression classifier achieved an accuracy of approximately 49.23%. This indicates that around 49.23% of the predictions made by the model were correct.

  • Error Metrics:

    • Mean Squared Error (MSE): The average squared difference between the actual and predicted values is 0.51.
    • Root Mean Squared Error (RMSE): The RMSE, which gives the average deviation of the predictions from the actual values, is approximately 0.71.
    • Mean Absolute Error (MAE): The average absolute difference between the actual and predicted values is 0.51.
  • Confusion Matrix: The confusion matrix shows the model's performance in terms of true positive, true negative, false positive, and false negative predictions. It indicates that out of 36 instances of the negative class, 28 were correctly predicted (true negatives), and out of 29 instances of the positive class, only 4 were correctly predicted (true positives). However, there were 8 false positives and 25 false negatives.

  • Classification Report:

    • Precision: The precision for the negative class (False) is 53%, meaning that out of all instances predicted as negative, 53% were actually negative. For the positive class (True), the precision is 33%, indicating that only 33% of the instances predicted as positive were actually positive.
    • Recall: The recall for the negative class is 78%, indicating that 78% of all actual negative instances were correctly classified. However, the recall for the positive class is only 14%, indicating that only 14% of all actual positive instances were correctly classified.
    • F1-score: The F1-score, which is the harmonic mean of precision and recall, is 0.63 for the negative class and 0.20 for the positive class.
    • Support: This represents the number of occurrences of each class in the dataset.
  • Conclusion: The Logistic Regression model shows limited performance, especially in correctly identifying instances of the positive class (True). Its precision, recall, and F1-score for the positive class are relatively low compared to the negative class. Further improvements may be needed, such as feature engineering, hyperparameter tuning, or exploring different algorithms, to enhance its performance.

The conclusion forAUC-ROC score

The AUC-ROC score of approximately 0.55 suggests that the classifier's performance in distinguishing between the positive and negative classes is slightly better than random chance. While this indicates some discriminatory power, it also implies that the classifier may not be highly effective in accurately classifying instances.

In practical terms, an AUC-ROC score of 0.55 indicates that the classifier's ability to differentiate between the two classes is only marginally better than random guessing. Therefore, further analysis and potentially model refinement may be necessary to enhance its performance and achieve better classification results.

Summary Discussion Limitations¶

The analysis presents a detailed evaluation of various classifiers for clustering and classification tasks, highlighting their strengths and limitations.

For clustering, the first dataset, after preprocessing and PCA, showed promising results with a high silhouette score and clear indications from both PCA and the Elbow method for the optimal number of clusters. However, it's important to note that while silhouette scores and PCA can provide insights, they do not guarantee the perfect clustering solution. Other clustering algorithms and evaluation methods could be explored to validate the findings further.

In classification, the Logistic Regression model exhibited limited performance, particularly in correctly identifying instances of the positive class. Despite achieving a moderate accuracy, its precision, recall, and F1-score for the positive class were notably lower than for the negative class. This suggests potential class imbalance issues or inadequacies in feature representation. Additionally, the AUC-ROC score marginally exceeding 0.5 indicates slight discriminatory power but falls short of achieving robust classification performance.

Limitations of the analysis include the reliance on a specific set of preprocessing steps and algorithms. Different preprocessing techniques or algorithms could yield different results. Furthermore, the evaluation metrics used provide insights into model performance but may not capture all aspects of the problem domain. For instance, in imbalanced datasets, accuracy alone may not be a sufficient metric for performance evaluation.

Moreover, the conclusions drawn from the analysis should be interpreted cautiously and validated through further experimentation, including cross-validation, parameter tuning, and potentially exploring ensemble methods or more advanced algorithms.

In practical applications, it's crucial to recognize these limitations and conduct thorough model validation and refinement to ensure the reliability and effectiveness of the chosen classification approach.

Reference¶

  • https://towardsdatascience.com/a-gentle-introduction-to-exploratory-data-analysis-f11d843b8184
  • https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
  • https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
  • https://www.analyticsvidhya.com/blog/2022/07/step-by-step-exploratory-data-analysis-eda-using-python/
  • https://bookdown.org/jhvdz1/ml2/preproc.html
  • https://dl.acm.org/doi/10.1145/3603709
  • https://www.codecademy.com/article/eda-prior-to-unsupervised-clustering
  • https://link.springer.com/article/10.1007/s41060-022-00346-9
  • https://www.sciencedirect.com/science/article/pii/S2666546822000441
In [ ]: